Terminal Device Oriented Comparable Corpora and its Alignment- Towards Extracting Paraphrasing Patterns

نویسندگان

  • Hiroshi Nakagawa
  • Hidetaka Masuda
  • Dai Sato
چکیده

Many terminal devices for mobile environment such as mobile phones have small and low resolution screens compared to the big and high resolution screen of personal computers. In this circumstance, Web pages for ordinary personal computer and mobile phones written in the same language are developed separately even though they describe the same topic or contents. In this research, we collected Web news articles aimed at displaying on personal computer screens and news articles aimed at mobile terminals for more than two years. Then we aligned these two kinds of news articles first in article level and then in sentence level. As the result, we got more than 88,000 pairs of aligned sentences. Next, we extract paraphrases of the final part of sentences from this aligned corpus. Actual results are the sentence final nouns of mobile article sentences and their counterpart expressions of Web article sentences. We extract character strings for paraphrases based on branching factor, frequency and length of string. The precision is 90% for highest ranked candidate and 80% for each top four candidates of 10 most frequently used nouns.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards Robust Context-Sensitive Sentence Alignment for Monolingual Corpora

Aligning sentences belonging to comparable monolingual corpora has been suggested as a first step towards training text rewriting algorithms, for tasks such as summarization or paraphrasing. We present here a new monolingual sentence alignment algorithm, combining a sentence-based TF*IDF score, turned into a probability distribution using logistic regression, with a global alignment dynamic pro...

متن کامل

Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment

We address the text-to-text generation problem of sentence-level paraphrasing — a phenomenon distinct from and more difficult than wordor phrase-level paraphrasing. Our approach applies multiple-sequence alignment to sentences gathered from unannotated comparable corpora: it learns a set of paraphrasing patterns represented by word lattice pairs and automatically determines how to apply these p...

متن کامل

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

Unsupervised Learning of Paraphrases

Paraphrasing constitutes a corner stone in many Natural Language Processing fields like monolingual text-to-text generation and automatic text summarization. Indeed, aligned monolingual corpora are likely to boost the learning process of text-to-text generation models. A Paraphrase learning strategy can be defined as a two-step process: (1) identifying and extracting related sentence pairs from...

متن کامل

Paraphrasing the meaning of physical environment comparative examining of audience-oriented, author-oriented and text-oriented (Islamic) approaches

Discussions derived from epistemology and its sub-branches are of the most important theoretical grounds affecting theoretical basis of art schools in particular architecture styles. During recent decades, two approaches of epistemology have been reciprocally shaped to know how one can paraphrase the meaning of physical environment. In the first approach, the audience and his knowledge are main...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004